210 ◾ Bioinformatics
across the whole genome. Genes code for proteins that determine traits and biological
functions. RNA-Seq allows investigation of the entire transcriptome through the profiling
of genes so researchers will know which sets of genes are coregulated during the conditions
studied. RNA-Seq is used to study diseases like cancers and patient’s response to therapeu-
tics and other diseases.
The first steps of the workflow of RNA-Seq are similar to the steps of the other NGS
applications as discussed in the previous chapters. The raw data is obtained in FASTQ
files, which can be single end or paired end. The raw data is subjected to quality con-
trol for the trimming of adaptors and removal of duplicate sequences and low-quality
reads. The cleaned raw data is either aligned to a reference genome or a reference tran-
scriptome. The alignment program should be chosen depending on the read length and
splicing awareness. STAR is a good choice for aligning short RNA-Seq reads. However, it
requires sufficient memory and storage space for the indexing of the reference sequence
and read alignment. The alignment information is stored in SAM/BAM files. Programs
like Samtools and PICARD can be used to manipulate and generate alignment statistics
from the BAM files. Quantification of reads aligned to each gene is essential in the RNA-
Seq data analysis because it is the only way to see the gene transcript abundance, which is
an indication for its biological activities. The reads aligned to each gene in BAM files are
counted using reads counting programs like htseq-count and featureCount. In addition to
the BAM files as inputs, these programs require also a reference annotation file in GTF for-
mat. The reads counts produced by the counting programs are stored in a tab-delimited file
containing gene symbols or gene IDs and the corresponding counts of the aligned reads
for each gene. The count file may contain a count column for each sample. Sample counts
in different files can be merged for easy processing. The count data is usually normalized
to be suitable for within-sample gene comparisons and between-sample gene comparisons.
The normalization is made to adjust for the biases arising from the differences in the gene
lengths, GC contents, and library sizes across samples. The normalization methods like
RPKM/FPKM and CPM adjust for the biases arising from the difference in gene lengths;
therefore, they can be used when comparison between the expressions of different genes
within the sample is intended. Other normalization methods like TMM and RLE adjust
for the bias arising from the difference in the library sizes across samples; therefore, they
can be used for between-sample differential expression analysis. The count data generated
by reads counting programs are further analyzed by the differential programs like edgeR,
DESeq2, and cuffdiff. As an example, we used edgeR to demonstrate the steps of the dif-
ferential analysis of the RNA-Seq data.
The differential analysis requires a metadata file that holds the information of the
samples and the study design from which the design matrix is generated. Genes with low
abundance can be filtered out since they do not contribute to the differential analysis. The
normalized count data as response variable and the design matrix as explanatory vari-
ables are used for a GLM fitting. Since the count data is over-dispersed (variance is greater
than the mean), the negative binomial distribution is assumed to fit the RNA-Seq count
data to a log-linear model to estimate the dispersions and other statistics for the differen-
tial gene expression including log-fold change, p-values, and false discovery rates (FDR).